Skip to content

Conversation

@varuntumbe
Copy link
Contributor

@varuntumbe varuntumbe commented Jun 15, 2025

What is the problem ?

When we have more then one static route over the same nexthop with BFD, session goes down and gets cleared if we remove one of the static route.

What is the Root Cause ?

  1. When we create static route over a nexthop with BFD, bfdd_dest_register routine will be called where for the first flow, we create new pcn object ( pcn_new ), we increment the refcount and notify the client via ptm_bfd_notify

But for the subsequent addition of static route over the same nh, means pcn_lookup would give the same existing pcn skipping the increment of refcount.

so essentially, refcount remains at value 1, even if you add any number of static route over the nh.

  1. When we remove one of the static, bfdd_dest_deregister will called where we do pcn_free ( where we decrement the refcount ) and session delete ( _ptm_bfd_session_del )

What is the fix ?

Fix has 2 parts

  1. Incrementing the refcount properly ( early before pcn_lookup) for every static route addition. This makes sures that refcount gets updated properly for every addition of static route

  2. Now when we start removing the static routes one by one, we decrement the refcount ( in bfdd_dest_deregister ) and will call the pcn_free only when refcount becomes 0 ( essentially deleting the session only when there are no static routes points to nh )

Closes #19014

@frrbot frrbot bot added the bfd label Jun 15, 2025
Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the commit also (instead of TBD).

* and the worst case is the refcount will detain us.
*/
_ptm_bfd_session_del(bs, BD_NEIGHBOR_DOWN);
//_ptm_bfd_session_del(bs, BD_NEIGHBOR_DOWN);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftovers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here,

This func call ( _ptm_bfd_session_del ) looks like it has been added for a specific scenario rather to address specific bug, not sure how to handle this here.

Basically this call should be deleted, why ?

As we remove the static routes one by one, refcount gets decremented one by one and once it reaches 0, we call the pcn_free where we clean up the pcn related data structures and call _ptm_bfd_session_del where it notifies the peer about the session down and frees up BFD session related data structures.

So according to my opinion, this call is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's drop it instead of commenting.

struct bfd_session *bs)
{
struct ptm_client_notification *pcn;
/* Lets increment the refcount in the beginning itself regardless of its a new pcn or old*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference here or how was it before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier flow :
bfdd_dest_register
pcn_new
==> here we try to find whether there is already a existing pcn for the particular bsd session, if found, we used to return without updating the ref_count ( refcount updation was later pcn_lookup ) which mess up the refcount logic.

===> if we add multiple static route over the same bfd peer nh, for the subsequent addition, pcn_lookup will give the existing pcn which result in NOT updating the refcount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now refcount incrementation has been added before the pcn_lookup which makes the refcount increment properly

@@ -0,0 +1,29 @@
hostname rt1
log file /tmp/topotests/bfd_ospf_topo1.test_bfd_ospf_topo1/rt1/mgmtd.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop every log file statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

log file /tmp/topotests/bfd_ospf_topo1.test_bfd_ospf_topo1/rt1/bfdd.log
no service integrated-vtysh-config
!
password 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the password statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

log file /tmp/topotests/bfd_ospf_topo1.test_bfd_ospf_topo1/rt1/ospf6d.log
log file /tmp/topotests/bfd_ospf_topo1.test_bfd_ospf_topo1/rt1/staticd.log
log file /tmp/topotests/bfd_ospf_topo1.test_bfd_ospf_topo1/rt1/bfdd.log
no service integrated-vtysh-config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

!
password 1
!
ip route 1.1.1.2/32 10.0.1.2 bfd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Cloudflare IPs, please. Use private ranges or documentation ranges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

exit
!
interface lo
ip address 2.2.2.2/32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use private or documentation ranges.

# SPDX-License-Identifier: ISC

#
# test_bfd_static_vrf.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name

# test_bfd_static_vrf.py
# Part of NetDEF Topology Tests
#
# Copyright 2025 6WIND S.A.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you behind 6wind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the copyright

@varuntumbe varuntumbe force-pushed the dev/bfd_static_route_fix branch from c2bc60d to 88e9541 Compare June 17, 2025 16:28
…utes over same nh

Adding the topotest which verifies whether BFD session stays up as long as there is one static route pointing to nexthop bfd peer

Signed-off-by: Varun Hegde <[email protected]>
This Fix makes the BFD session to stay up until the last static route going over the gateway ip which is tracked by bfd.

Signed-off-by: Varun Hegde <[email protected]>
@varuntumbe varuntumbe force-pushed the dev/bfd_static_route_fix branch from 88e9541 to 93ed0ce Compare June 17, 2025 16:34
@varuntumbe
Copy link
Contributor Author

======> cmd : show bfd peers json

[
{
"multihop":false,
"peer":"2001:db8:1::1",
"local":"2001:db8:1::2",
"vrf":"default",
"interface":"r2-eth0",
"id":2411218638,
"remote-id":38497228,
"passive-mode":false,
"log-session-changes":false,
"status":"up",
"uptime":4,
"diagnostic":"ok",
"remote-diagnostic":"ok",
"type":"dynamic",
"receive-interval":600,
"transmit-interval":600,
"echo-receive-interval":50,
"echo-transmit-interval":0,
"detect-multiplier":3,
"remote-receive-interval":600,
"remote-transmit-interval":600,
"remote-echo-receive-interval":50,
"remote-detect-multiplier":3,
"rtt-min":0,
"rtt-avg":0,
"rtt-max":0
},
{
"multihop":false,
"peer":"2001:db8:2::1",
"vrf":"default",
"id":3088557958,
"remote-id":2847372379,
"passive-mode":false,
"log-session-changes":false,
"status":"up",
"uptime":3,
"diagnostic":"ok",
"remote-diagnostic":"ok",
"type":"dynamic",
"receive-interval":2000,
"transmit-interval":2000,
"echo-receive-interval":50,
"echo-transmit-interval":0,
"detect-multiplier":3,
"remote-receive-interval":2000,
"remote-transmit-interval":2000,
"remote-echo-receive-interval":50,
"remote-detect-multiplier":3,
"rtt-min":0,
"rtt-avg":0,
"rtt-max":0
}
]

================> from the router

[
{
"detect-multiplier": 3,
"diagnostic": "ok",
"echo-receive-interval": 50,
"echo-transmit-interval": 0,
"id": "",
"interface": "r2-eth0",
"local": "2001:db8:1::2",
"multihop": false,
"passive-mode": false,
"peer": "2001:db8:1::1",
"receive-interval": 600,
"remote-detect-multiplier": 3,
"remote-diagnostic": "ok",
"remote-echo-receive-interval": 50,
"remote-id": "
",
"remote-receive-interval": 600,
"remote-transmit-interval": 600,
"status": "up",
"uptime": "",
"transmit-interval": 600
},
{
"detect-multiplier": 3,
"diagnostic": "ok",
"echo-receive-interval": 50,
"echo-transmit-interval": 0,
"id": "
",
"interface": "r2-eth1",
"local": "2001:db8:2::2",
"multihop": false,
"passive-mode": false,
"peer": "2001:db8:2::1",
"receive-interval": 2000,
"remote-detect-multiplier": 3,
"remote-diagnostic": "ok",
"remote-echo-receive-interval": 50,
"remote-id": "",
"remote-receive-interval": 2000,
"remote-transmit-interval": 2000,
"status": "up",
"uptime": "
",
"transmit-interval": 2000
}
]

===============> expected o/p from the testcase file

one of the topotest failing consistantly ( bfd_topo3 ) where expected does not match the received o/p from the router.

When I checked, its failing because its not able to find "local" and/or "interface" fields ( which are in expected files but not in router gen o/p )

I dont see how my changes affected this failure. Any idea @ton31337 ?

Thanks,
Varun

!

interface lo
ip address 1.1.1.1/32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Cloudflare, please, use private or documentation ranges.

exit
!
interface lo
ip address 2.2.2.2/32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use private or documentation ranges.

#
# test_bfd_static_topo1.py
#
# Copyright (c) 2024 by Varun Hegde
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My calendar now shows it's 2025.

* and the worst case is the refcount will detain us.
*/
_ptm_bfd_session_del(bs, BD_NEIGHBOR_DOWN);
//_ptm_bfd_session_del(bs, BD_NEIGHBOR_DOWN);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's drop it instead of commenting.

@thierrydevriendt
Copy link

Hello,

Could you let me know when this fix can be merged?

Thanks.

@ton31337
Copy link
Member

@varuntumbe are you still working on this?

@varuntumbe
Copy link
Contributor Author

@varuntumbe are you still working on this?

Hi @ton31337 ,

This solution is fundamentally wrong ( as I found out when one of the bfd topo test failed ). This needs a rework. I tried reworking this couple of months back but was not able to code it up to a proper solution. I am not working on this now. Apologies.

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Removing the bfd from a single static route removes the bfd session even if there are more static routes using same gateway with bfd

3 participants